-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: use sync.Cond to handle no-task blocking wait #299
Conversation
case <-tq.noTaskSignal: | ||
} | ||
tq.noTaskCond.L.Lock() | ||
for tq.activeTasks > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this looks unnecessary, but my assumption here is that there are no guarantees as to who gets the next mutex lock if there are multiple parties waiting on it - both via Wait()
and Lock()
. So if a second task started and got the lock before our wait caller got a mutex notify then we'd not be in a tq.activeTasks>0
condition even after waking up here.
At least that's how it would work in C++ and Java. Or do we have stronger guarantees in Go about who gets the lock next after a Broadcast()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync.Cond is actually what I was thinking about suggesting if you wanted to make this more flexible :)
note that there's also sync.WaitGroup, but it's designed for a static pattern where you Add tasks upfront, and Wait afterwards. It may panic if you call Add after Wait, which we may do here. It's a very close use case, but not applicable.
case <-tq.noTaskSignal: | ||
} | ||
tq.noTaskCond.L.Lock() | ||
for tq.activeTasks > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the loop is right - the sync.Cond.Wait docs use a loop as an example too.
} | ||
tq.noTaskCond.L.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync.Cond.Broadcast doesn't require holding the lock to be called, so I wonder if you could make this loop lock-free by also continuing to use sync/atomic for activeTasks.
I think that's ideally what we want, because otherwise we're adding some amount of lock contention on each worker. The locks are only held for tiny amounts of time here, but they still add their overhead, whereas atomics are comparatively very cheap.
In other words, I think we can just use sync.Cond for its broadcast feature, and continue using atomics for the counter. If our condition was more complex and we couldn't use atomics, then we'd need the shared lock for sure, but I don't think we absolutely need it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Careful: if a sync.Cond.Broadcast()
is issued and there isn't anyone Wait()
ing to "hear it": the broadcast is discarded. You effectively need at least an implicit ( by some other means ) "lock", to ensure that there is something in Wait()
ing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this depends on when the Wait call happens. I assume we call Wait before the workers all go idle. If we may call Wait at a later time, then indeed we need something more. Perhaps Wait should first atomically check if we're already idle, and if so, return immediately. Otherwise, block until the next broadcast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, the current code already does this - it only Waits inside the loop after it checks that we're not already idle.
I think you're saying that replacing the lock with atomics on the broadcast side could add a race, like the following sequence of events:
- WaitForNoActiveTasks grabs the lock
- WaitForNoActiveTasks sees the counter is 1, so it enters the loop
- worker finishes a task, and broadcasts without grabbing the lock
- WaitForNoActiveTasks calls Cond.Wait, which doesn't see the broadcast as it is late
Whereas with Rod's current code:
- WaitForNoActiveTasks grabs the lock
- WaitForNoActiveTasks sees the counter is 1, so it enters the loop
- worker finishes a task, and tries to grab the lock to broadcast - blocking, as WaitForNoActiveTasks has the lock
- WaitForNoActiveTasks calls Cond.Wait, releasing the block temporarily
- worker grabs the lock and broadcasts
- WaitForNoActiveTasks sees the broadcast and finishes
So I think you're right. we don't need to hold the lock to broadcast, like the docs say, but not doing so inserts a form of logic race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup! This is what I was trying to convey, sorry for being terse!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM - as discussed in the thread, I don't think we can make this lock-free on the worker side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests (#284) * feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests * fixup! feat: add WorkerTaskQueue#WaitForNoActiveTasks() for tests fix(responsemanager): fix flaky tests fix(responsemanager): make fix more global feat: add basic OT tracing for incoming requests Closes: #271 docs(tests): document tracing test helper utilities fix(test): increase 1s timeouts to 2s for slow CI (#289) * fix(test): increase 1s timeouts to 2s for slow CI * fixup! fix(test): increase 1s timeouts to 2s for slow CI testutil/chaintypes: simplify maintenance of codegen (#294) "go generate" now updates the generated code for us. The separate directory for a main package was unnecessary; a build-tag-ignored file is enough. Using gofmt on the resulting source is now unnecessary too, as upstream has been using go/format on its output for some time. Finally, re-generate the output source code, as the last time that was done we were on an older ipld-prime. ipldutil: use chooser APIs from dagpb and basicnode (#292) Saves us a bit of extra code, since they were added in summer. Also avoid making defaultVisitor a variable, which makes it clearer that it's never a nil func. While here, replace node/basic with node/basicnode, as the former has been deprecated in favor of the latter. Co-authored-by: Hannah Howard <hannah@hannahhoward.net> fix: use sync.Cond to handle no-task blocking wait (#299) Ref: #284 Peer Stats function (#298) * feat(graphsync): add impl method for peer stats add method that gets current request states by request ID for a given peer * fix(requestmanager): fix tested method Add a bit of logging (#301) * chore(responsemanager): add a bit of logging * fix(responsemanager): remove code change chore: short-circuit unnecessary message processing Expose task queue diagnostics (#302) * feat(impl): expose task queue diagnostics * refactor(peerstate): put peerstate in its own module * refactor(peerstate): make diagnostics return array
Based on feedback from #284, I think this is closer to what I want. While we currently don't have use for >1 consumer, I want
WaitForNoActiveTasks()
to be a general, and potentially (or at least safely) multi-consumer blocking call that doesn't return until there are zero active tasks.